32

Quantization of Neural Networks

decoder module to low bits, i.e., (1)+(2)+(3), brings the most significant accuracy drops

of accuracy among all parts of the DETR methods, up to 2.1% in the 3-bit DETR-R50.

At the same time, other parts of DETR show comparative robustness to the quantization

function. Consequently, the critical problem of improving the quantized DETR methods is

restoring the information in MHA modules after quantization. Other qualitative results in

Fig. 2.8 and Fig. 2.9 also indicate that the degraded information representation is the main

obstacle to a better quantized DETR.

2.4.3

Information Bottleneck of Q-DETR

To address the information distortion of the quantized DETR, we aim to improve the

representation capacity of the quantized networks in a knowledge distillation framework.

Generally, we utilize a real-valued DETR as a teacher and a quantized DETR as a student,

distinguished with superscripts T and S.

Our Q-DETR pursues the best tradeoffbetween performance and compression, which

is precisely the goal of the information bottleneck (IB) method through quantifying the

mutual information that the intermediate layer contains about the input (less is better)

and the desired output (more is better) [210, 223]. In our case, the intermediate layer comes

from the student, while the desired output includes the ground-truth labels as well as the

queries of the teacher for distillation. Thus, the objective target of our Q-DETR is:

min

θS I(X; ES)βI(ES, qS; yGT )γI(qS; qT ),

(2.27)

where qT and qS represent the queries in the teacher and student DETR methods as

predefined in Eq. (2.26); β and γ are the Lagrange multipliers [210]; θS is the parame-

ters of the student; and I(·) returns the mutual information of two input variables. The

first item I(X; ES) minimizes information between input and visual features ES to extract

task-oriented hints [240]. The second item I(ES, qS; yGT ) maximizes information between

extracted visual features and ground-truth labels for better object detection. Common net-

work training and detection loss constraints can easily accomplish these two items, such as

proposal classification and coordinate regression.

The core issue of this paper is to solve the third item I(qS; qT ), which attempts to

address the information distortion in student query via introducing teacher query as a

priori knowledge. To accomplish our goal, we first expand the third item and reformulate

it as:

I(qS; qT ) = H(qS)H(qS|qT ),

(2.28)

where H(qS) returns the self information entropy expected to be maximized while

H(qS|qT ) is the conditional entropy expected to be minimized. It is challenging to optimize

the above maximum and minimum items simultaneously. Instead, we make a compromise to

reformulate Eq. (2.28) as a bi-level issue [152, 46] that alternately optimizes the two items,

which is explicitly defined as:

min

θ

H(qS|qT ),

s. t.

qS= arg max

qS

H(qS).

(2.29)

Such an objective involves two sub-problems, including an inner-level optimization to

derive the current optimal query qSand an upper-level optimization to conduct knowledge

transfer from the teacher to the student. Below, we show that the two sub-problems can be

solved in the forward and backward network propagation.